September 9, 2015

The Joy of Painting

Goals

  • cluster paintings into groups containing similar things
  • name the groups
  • “average” Bob Ross paintings
  • avoid watching 400 episodes of The Joy of Painting

Data Collection

  • downloaded 222 .jpg files from saleoilpaintings.com (using R)
  • rescale paintings to 20 x 20 pixels (using Python)

becomes

A bit bigger…

Turning Paintings into Data

  • Convert .jpg to .txt files (using ImageMagick)
    • (r,g,b) color values for each pixel are variables
    • \(3 \times 20 \times 20 = 1200\) variables!
    • 222 observations of 1200 variables each
##   id X1r X1g X1b X2r X2g X2b X3r X3g X3b
## 1  1 191 170 175 186 169 159 198 186 160
## 2  2  76  52  74  78  53  74  78  51  68
## 3  3  56  61  65 108 113 117 135 138 143
## 4  4   6 138 177   9 123 175  12  99 168
## 5  5 151 149 154 154 152 157 163 158 162
## 6  6 173 168 130 181 173 136 187 177 141

Principal Component Analysis

  • number of variables (1200) is way bigger than number of observations (222)
  • need dimension reduction
  • principal components analysis on the \(222 \times 1200\) data frame
    • translate each high dimensional point into a lower dimensional space
    • \(1^{st}\) PC orders the points on a number line in 1D, \(1^{st}\) & \(2^{nd}\) PC place the points on a 2D plane, etc.

Clustering Methods

Investigated the following methods on the \(222 \times 222\) reduced-dimension data:

  • Hierarchical clustering
  • Model-Based clustering
  • K-Means clustering

Hierarchical Clustering

Agglomerative hierarchical clustering:

  • Work from the bottom up
  • Each observation starts as its own cluster then get combined according to a rule
  • Rule = complete linkage: joins 2 clusters with minimum maximum distance

Model-Based Clustering

Data follow a Gaussian mixture model

  • Clusters form ellipsoid “clouds”
  • Can be different sizes, shapes, and direction
  • Was too restrictive a method for my purposes

K-means Clustering

Forms \(K\) groups around \(K\) centers

  • Centers are randomly chosen at first - set a seed
  • Points get assigned to closest center
  • Average points, get new centers
  • Repeat until convergence

Choosing a K

Number of groups chosen by looking at cluster statistics

Best Model

K-Means, \(K = 20\)

C# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# in 6 12 8 8 16 15 4 8 17 6 6 5 6 24 10 11 8 18 19 15

The Groups

  • Looked at the original paintings after grouping
  • Gave the 20 groups names based on their contents
  • Smallest group called “blues and browns”

Average Painting in the Groups

In each group of paintings:

  • Group the paintings by pixel location
  • Average red, blue, green color values over all paintings in group
  • Combine these averages into an (r,g,b) color value
  • Plot these color values on their assigned pixel location, creating an average painting

Average Paintings (1/5)

Average Paintings (2/5)

Average Paintings (3/5)

Average Paintings (4/5)

Average Paintings (5/5)

* Bright blue skies behind snow-capped mountains near rivers and coniferous trees

Conclusion

  • The average paintings look like the group names
  • Very efficient method
  • Able to see themes in Bob Ross paintings

Acknowledgements

Thank you for listening!